Built with R version 4.4.0.

1 Introduction

The purpose of this document is to make open and share-able the research methods used for my dissertation on nearshore fish communities in Alaska working towards a PhD in marine biology at the University of Alaska Fairbanks. Here, I cover steps of the data preparation for my second and third chapters concerning spatial and temporal distributions of nearshore fishes across the state. This and other files can be accessed via the Kachemak Bay National Estuarine Research Reserve’s github nearshore fish repository.

In this document I share exploratory data analyses conducted on the NOAA Nearshore Fish Atlas (NFA) database. The NFA database can be found here, https://alaskafisheries.noaa.gov/mapping/sz/. Namely, this document contains the steps taken after cleaning/wrangling raw data. In particular I produce various visualizations of the data in space and time. Initial wrangle steps can be viewed at this Rpubs page and its data objects are sourced below (file “NFA.rda”).

1.1 Set up

Load required packages, define directory, set options, source/load files:

# Packages
library(tidyverse)
library(lubridate)
library(here)
library(leaflet)
library(RColorBrewer)

# Directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(name, path, i)
}

# Options

# Source/Load
load(file.path(dir.data, "NFA_wrangled.rda")) # wrangled data

1.2 Background and objectives

Based on my own research and related literature, I can broadly interpret these beach seine data in a spatiotemporal context. Depending on when (time of year) and where (location and habitat), I have a general structure in mind of the diversity of the community. Depending on who’s there, I have a rough guess of the relative abundance of each community member (at least for the more commonly caught ones). My hope with working with the NFA data is that this broad understanding can be formally tested with inference and statistics, and that more informative research questions can be answered.

Findings from other nearshore fish researchers largely agree that seasonality (or some related environmental condition) is a strong predictor of community or species presence. Interannual differences can exhibit wide variability and should be accounted for whenever possible. And at larger time-scales (e.g., decadal), communities exhibit changes which may relate to regime shifts occurring more broadly than the nearshore.

Spatially, large-scale (regional) effects appear to be as or more important than sub-regional or local-scale effects. However, many studies also find significant relationships in community response at these smaller scales. Likely, the most appropriate spatial consideration depends on the question being asked. So maybe a more appropriate question to ask of the NFA data is if there any significant spatial scales discernible in the data, and also how should we address those scales in future research, such as studies on subsets of the taxa or habitat management considerations.

I’ll start by exploring the structure of our ‘visits’ dataframe, and then I’ll move on to visualizing the information derived from out ‘catch’ dataframe.

2 Exploring visit data

2.1 Summary of visits

Remember that in the data wrangling stage, we created a new sample identifier based on site and date which we called VisitID. Let’s take a look at the variables associated with each VisitID:

glimpse(visits)
## Rows: 1,867
## Columns: 7
## $ VisitID    <chr> "1_2001-07-24", "1_2002-03-25", "1_2001-04-26", "1_2002-07-…
## $ Replicates <int> 5, 7, 4, 4, 5, 3, 4, 2, 6, 1, 2, 2, 6, 4, 4, 4, 4, 3, 2, 2,…
## $ Date       <date> 2001-07-24, 2002-03-25, 2001-04-26, 2002-07-24, 2003-03-22…
## $ Lat        <dbl> 58.56626, 58.56296, 58.56408, 58.56755, 58.55804, 58.56858,…
## $ Lon        <dbl> -134.9105, -134.9082, -134.9137, -134.9133, -134.9102, -134…
## $ MeshSize   <dbl> 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2,…
## $ Cluster    <fct> 327, 327, 327, 327, 327, 327, 327, 327, 327, 327, 327, 326,…

During the wrangling stage, I decided that Replicates and MeshSize should be the main covariates that needed consideration. At this point they are classed as integers/numbers, but we will probably want them to be factors instead. Cluster is a byproduct of clustering some sites together, so it may not be needed if we’re using Lat/Lon as our primary explanatory variable. Still, it could be a useful grouping factor later on. Date will likely be mutated into multiple additional periods (day of year, month, year) to see if any of those factors are useful in explaining fish data. We’ll address edits to these variables as they come up in our exploration.

2.2 Spatial distribution

leaflet(visits) %>% 
  addTiles(options = tileOptions(minZoom = 3.5,
                                 zIndex = 0.5)) %>%
  setView(lat = 65, lng = -152, zoom = 3.5) %>%
  addSimpleGraticule(interval = 5) %>%  
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   stroke = FALSE,
                   popup = ~VisitID,
                   clusterOptions = markerClusterOptions())

In the map above we can view all of our samples from across Alaska. Leaflet options are nice because you can optionally group samples on the map and show a boundary polygon of the area they cover when you hover over their icon. At the minimum zoom level (3.5), we see the largest group spanning nearly all of Southeast Alaska, followed by a group in Southcentral that covers Prince William Sound, Cook Inlet, and the northern half of Kodiak Island. The third largest group forms a much smaller polygon area around Utqiagvik. From there, we see even smaller groups from the Eastern Beaufort Sea around Kaktovik, the Bering Strait and Southern Chukchi Sea near Kotzebue, and Bristol Bay. There are also two groups from the Alaska Peninsula and Aleutian Islands, one focused around Adak and another spanning from Sand Point to Unalaska. We can see how subsets of the samples change by zooming in. When there are single samples in view, hovering over those will tell us its VisitID name.

Obviously, this visualization of the data is not perfect and pretty rough- note that the groups are formed by the overlap of samples based on pixel radius (10). For example, the polygon of samples from Bristol Bay at the minimum zoom level actually crosses the Alaska Peninsula to include a sample from Aghiyuk Island. Although, the map is useful in gaining a general understanding of the existing clusters of samples within the NFA database.

2.3 Regions

We may want to define a spatial factor based on these rough groupings as a starting place to see if the smaller, more isolated groups should be lumped with other groups or not. Actually, the NFA already has a Region classifier per site, but the reason I do not want to use it is because they linked Region to SiteID. When we created our new VisitID, some samples now contained two Region labels- these were mostly cases from around Utqiagvik where some were labelled as Chuckchi and others as Beaufort but actually occurred on the same day and within a very short distance of each other.

Instead of using the the given regional classes, let’s just make our own variable called ‘Region’ and add it to the our visits df:

# Start with events so that we can make use of Location info for unique cases,
regions = events %>%
  select(EventID, VisitID, Lat, Lon, Location) %>%
  mutate(Region = case_when(Lat > 69 & Lat < 80 & Lon > -150 & Lon < -140 ~ "Beaufort East",
                            Lat > 69 & Lat < 80 & Lon > -160 & Lon < -150 ~ "Chukchi/Beaufort",
                            Lat > 65 & Lat < 69 & Lon > -170 & Lon < -160 ~ "Chukchi South",
                            Lat > 57 & Lat < 60 & Lon > -165 & Lon < -155 ~ "Bristol Bay",
                            Lat > 50 & Lat < 55 & Lon > -180 & Lon < -175 ~ "Aleutians Adak",
                            Lat > 50 & Lat < 55 & Lon > -170 & Lon < -165 ~ "Aleutians Unalaska",
                            Lat > 55 & Lat < 65 & Lon > -155 & Lon < -145 ~ "GOA Southcentral",
                            Lat > 53 & Lat < 61 & Lon > -148 & Lon < -130 ~ "GOA Southeast"))

# Check which samples we missed,
filter(regions, is.na(Region))
## # A tibble: 7 × 6
##   EventID VisitID           Lat   Lon Location                            Region
##     <dbl> <chr>           <dbl> <dbl> <chr>                               <chr> 
## 1    6635 1564_2006-06-22  52.7 -171. Yunaska Island                      <NA>  
## 2    6636 1565_2006-07-26  56.2 -157. Semidi Islands, Aghiyuk Island      <NA>  
## 3    6637 1566_2006-07-31  55.1 -160. Shumagin Islands, Bendel Island     <NA>  
## 4    6638 1566_2006-07-31  55.1 -160. Shumagin Islands, Bendel Island     <NA>  
## 5    6639 1568_2006-07-31  55.1 -160. Shumagin Islands, Spectacle Island  <NA>  
## 6    6640 1569_2006-07-31  55.1 -160. Shumagin Islands, Nagai Island, Mi… <NA>  
## 7    6641 1569_2006-07-31  55.1 -160. Shumagin Islands, Nagai Island, Mi… <NA>
# Let's address those using Location,
regions = mutate(regions,
       Region = case_when(!is.na(Region) ~ Region,
                          is.na(Region) & str_detect(Location, "Yunaska") ~ "Aleutians Yunaska",
                          is.na(Region) & str_detect(Location, "Aghiyuk") ~ "Aleutians Aghiyuk",
                          is.na(Region) & str_detect(Location, "Shumagin") ~ "Aleutians Shumagin"))

# Check again (should be 0),
filter(regions, is.na(Region)) %>% nrow()
## [1] 0

Now let’s add Region as a factor to our visits df, and show it on a map. Since we’re manipulating our data objects again, let’s also rename visits so that we can keep track of our changes.

# Call our wrangled df version visits.0 and remove the non-version 'visits' if it's there,
if ("visits" %in% ls()) assign("visits.0", visits)
if ("visits" %in% ls()) rm(visits)

# Make Region a factor and join by VisitID,
visits.1 = regions %>%
  select(VisitID, Region) %>%
  distinct() %>%
  mutate(Region = as.factor(Region)) %>%
  left_join(visits.0, ., by = "VisitID")

# Function for regional palette
palette.region = colorFactor(palette = brewer.pal(n = visits.1$Region %>% n_distinct(),
                                                  name = 'Spectral'),
                             domain = factor(visits.1$Region))

# Map
leaflet(visits.1) %>% 
  addTiles() %>%
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   color = ~palette.region(Region),
                   fillColor = ~palette.region(Region),
                   popup = ~paste(Region, VisitID),
                   stroke = FALSE) %>%
  addLegend(position = "topright",
            pal = palette.region, values = ~Region,
            title = "Region",
            opacity = 0.75)

Like we saw before, the majority of our samples are located in the Gulf of Alaska (GOA), namely Southeast (SEAk) and Southcentral (SCAK). If we were to look at differences in catach in the GOA, it’d be cool if we could include the Aleutian samples too. Obviously, we’ll want to address those few instances from Aghiyuk, Shumagin, and Yunaska, and ideally we aggregate them into one or two groups. We also see relatively small sample sizes from Bristol Bay and from East Beaufort, which may or may not be prohibitive in conducting meaningful tests. One other consideration is the couple of samples along the Bering Strait coast which are lumped together with a larger concentration of samples from South Chukchi.

Some of our first analyses should see if the more isolated samples are different from the other samples. If not, then we can aggregate. If yes, then we should report any significant tests.

2.4 Temporal distribution

Next we’ll want to take a look at how our samples distribute throughout various time periods. This will give us a good idea of how well our samples may or may not overlap- particularly, we’ll want to see overlaps in years and months for better comparisons of the catch data.

# First create a df specifically containing at time variables
times = visits.1 %>%
  select(VisitID, Date) %>%
  mutate(Year = year(Date),
         Month = month(Date, label = TRUE),
         Week = week(Date),
         Day = yday(Date))

# Year
times %>%
  count(Year) %>%
  ggplot(., aes(x = Year, y = n)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Year")

# Year without 1976
times %>%
  filter(Year > 1980) %>%
  count(Year) %>%
  ggplot(., aes(x = Year, y = n)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Year without 1976")

# Month
times %>%
  count(Month) %>%
  ggplot(., aes(x = Month, y = n)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Month")

# Week
times %>%
  count(Week) %>%
  ggplot(., aes(x = Week, y = n)) +
  geom_col() +
  geom_text(aes(label = n), hjust = -0.15, angle = 90, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Week")

# Day of Year
times %>%
  count(Day) %>%
  distinct() %>%
  ggplot(., aes(x = Day, y = n)) +
  geom_col() +
  labs(x = NULL, y = NULL, title = "No. samples per Day of Year")

By year, we see a large variability in number of samples, as well as a lone 1976 year which I recognize to be Blackburn’s dataset from Lower Cook Inlet. That 1976 is an interesting dataset that could be useful for a look at catches over time in that specific area, but it’ll need to be excluded from any other comparisons. The lowest number of samples occur in 2010, 2020, and 2021. We may want to pay close attention to these years in case they show up as outliers. It seems that 1996-2000 is a period of heavy sampling, followed by the mid 2010s and somewhat active year in 2006.

By month, we see an obvious uptick during the summer which makes sense given winter conditions along most of Alaska’s coast (light, ice, etc.). March and October are interesting months because of the potential for impacts on fishes during years with warmer shoulder seasons. November thru February will likely be removed for our purposes. Our key months look to be June, July and August, with possible inclusion of April, May, and September.

It’s interesting if we look at the by-week and by-day graphs, we see a dip in sampling around mid-June/early-July. Not sure why this is- maybe related to more targeting of juvenile salmonids in early June, or maybe so reduced effort around the 4th of July holiday.

2.5 Year & month by region

Let’s now see how our samples overlap by Region and Year, and by Region and Month. For these purposes, let’s combine the Aleutians into one group for now, and remove the 1976 data.

# Samples by Year and Region,
left_join(filter(times, Year > 1980),
          select(visits.1, VisitID, Region), by = "VisitID") %>%
  mutate(Region = case_when(str_detect(Region, "Aleutians") ~ "Aleutians",
                            !str_detect(Region, "Aleutians") ~ Region)) %>%
  ggplot(aes(x = Date, y = Region, color = Region)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1.9, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               }) +
  scale_color_brewer(palette = "Spectral") +
  scale_x_date(date_labels = "%y",
               date_breaks = "2 years") +
  labs(x = "Year", y = "Region",
       title = "Samples by Year and Region",
       subtitle = "No. samples centered over median") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Month and Region,
left_join(filter(times, Year > 1980),
          select(visits.1, VisitID, Region), by = "VisitID") %>%
  mutate(Region = case_when(str_detect(Region, "Aleutians") ~ "Aleutians",
                            !str_detect(Region, "Aleutians") ~ Region)) %>%
  ggplot(aes(x = Month, y = Region, color = Region)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1.9, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               })+
  scale_color_brewer(palette = "Spectral") +
  labs(x = "Month", y = "Region",
       title = "Samples by Month and Region",
       subtitle = "Excluding 1976; no. samples labelled by month") +
  theme_minimal() +
  theme(legend.position = "none")

Interesting that each region’s samples center on a different year. At a glance, I see the most overlap in 2006 where the Aleutians and both GOA regions look to have similar sample densities, plus there is a smattering of Chukchi/Beaufort sampling too. In 2009, the Bristol Bay samples may be compared with both GoA regions and the Chukchi/Beaufort, although the samples look more sparse across the regions. And in 2013, the Chukchi/Beaufort has good overlap with both GOA regions again. I see nice periods of comparison between SEAK and SCAK, but there is also a concerning difference in yearly distribution of samples between the two regions. Depending on how important year effects appear- we may need to trim the data for better comparisons.

By month, we see that August contains samples from all regions. June is the center for SEAK and the Aleutians, July for SCAK, and August for Chukchi/Beaufort. Best comparisons from this monthly view may be an August (or lumped July + August) comparison of all Regions. Also, a June (or lumped June thru August) comparison of GOA and the Aleutians would be good. And, we should be able to nicely compare SEAK and SCAK from May to September, potentially including April.

For now, we’ll hold off on excluding any of the data. It’ll be better to revisit these considerations after deciding which research questions to tackle, so subsets of the data can be made specific to each one.

2.6 Mesh Size and Replicates

Let’s make similar visuals as before, but now focus on Mesh Size and Replicates in space and time. First, we can re-map the samples to color by Mesh Size and size of points relative Replicates (large dots = more replicates).

# Palette function to color markers
palette.mesh = colorFactor(palette = brewer.pal(n = visits.1$MeshSize %>% n_distinct(),
                                                name = 'Spectral'),
                           domain = factor(visits.1$MeshSize))
# Map
leaflet(visits.1) %>% 
  addTiles() %>%
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   radius = ~(sqrt(Replicates*15)),
                   color = ~palette.mesh(MeshSize),
                   fillColor = ~palette.mesh(MeshSize),
                   popup = ~VisitID,
                   stroke = FALSE) %>%
  addLegend(position = "bottomright",
            pal = palette.mesh, values = ~MeshSize,
            title = "Mesh Size (mm)",
            opacity = 0.75)

I have an issue discerning among the seven different sizes since those warmer colors do not diverge well. Although there are some patterns to make note of: the largest mesh size of 12.7 only occurs in a small area of SEAK, the 10 mm samples similarly only occur with the 12.7 mm samples (likely the Lundstrom et al. 2022 gear comparison), and the smallest size at 3 mm only occurs in Cook Inlet. I don’t see any obvious spatial patterns in Replicates, although this may not be the best view of it.

Let’s see the frequency of samples for both variables:

# Table view
table(visits.1$MeshSize)
## 
##    3  3.2    6  6.4  9.5   10 12.7 
##  436  821  286   35   82   59  148
table(visits.1$Replicates)
## 
##   1   2   3   4   5   6   7   8   9  10  12 
## 665 533 334 142  60  90  19  21   1   1   1

There are more 12.7 mm samples than I would have guessed which is a shame because if they are completely different then we lose a lot of samples. I am wondering if sizes are accurate- I would think that 3 mm and 3.2 mm are effectively similar in practice. I am also wondering if some projects report, say 3 mm instead of 3.2 mm, because of rounding. I would assume that the nets used were manufactured in the US, which would mean that specifications are given in standard not metric. Let’s do a quick check on how typical mesh sizes convert from inches to millimeters.

# Function based on inch to millimeter conversion
in_to_mm = function (x) {
  return(y = 25.4 * x)
}

# Typical mesh sizes in inches
sizes = c((1/8), (1/4), (3/8), (1/2), (5/8), (3/4))

# Convert to millimeters
in_to_mm(sizes)
## [1]  3.175  6.350  9.525 12.700 15.875 19.050

There is obviously some funny rounding happening in the data. I do not see an issue by lumping mesh sizes that are within one millimeter. Even if the nets were actually different by fractions of a millimeter, I don’t think it would affect the selectivity of fishes all that much- especially considering that most of these researchers were capturing fish to be measured to the nearest millimeter.

Let’s see that map again with this in mind:

# Palette function for binned sizes
palette.mesh.bin = colorBin(palette = brewer.pal(n = 4, name = 'Spectral'),
                            domain = visits.1$MeshSize %>% unique(),
                            bins =  c(0, 4, 7, 11, 13))
# Map
leaflet(visits.1) %>% 
  addTiles() %>%
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   radius = ~(sqrt(Replicates*15)),
                   color = ~palette.mesh.bin(MeshSize),
                   fillColor = ~palette.mesh.bin(MeshSize),
                   popup = ~VisitID,
                   stroke = FALSE) %>%
  addLegend(position = "bottomright",
            colors = c(palette.mesh.bin(3),
                       palette.mesh.bin(6),
                       palette.mesh.bin(9),
                       palette.mesh.bin(12)),
            values = ~MeshSize,
            title = "Mesh Size (mm)",
            opacity = 0.75,
            labels = c("3 or 3.2", "6 or 6.4", "9.5 or 10", "12.7"))

Much easier to read! I see some concerning patterns we’ll want to investigate further as we go. The majority of samples appear to be using the 3.2 mm size, including both Arctic groups that seem to use it exclusively. There is a large density of 6.4 mm samples in Northern SEAK and throughout the Aleutians. The same issues are visible concerning the two larger mesh sizes. We should do a comparison of 3.2 mm and 6.4 mm to see if there are any significant differences in catch.

With the simpler color scheme, I am also noticing the difference in Replicates a little easier. Just looking at GOA and Aleutian samples, I see that many of the 6.4 mm samples are also low-Replicate samples. Although, many of our samples only contain one replicate so this may just be something I’m seeing and not actually a concern.

Let’s take a look at how these variables in graphical format. I’ll go ahead and aggregate similar mesh sizes and re-classify the both variables as factors, and add in variables for year and month.

visits.2 = mutate(visits.1,
                  Replicates = as_factor(Replicates),
                  # Combine similar mesh sizes to the tenth decimal
                  MeshSize = case_when(MeshSize < 4 ~ 3.2,
                                       MeshSize > 4 & MeshSize < 7 ~ 6.4,
                                       MeshSize > 7 & MeshSize < 11 ~ 9.5,
                                       MeshSize == 12.7 ~ 12.7) %>%
                    as.factor()) %>%
  # Add temporal variables
  left_join(select(times, VisitID, Year, Month), by = "VisitID")
# Samples by MeshSize and Year,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Date, y = MeshSize, color = MeshSize)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -3, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               }) +
  scale_color_brewer(palette = "Spectral") +
  scale_x_date(date_labels = "%y",
               date_breaks = "2 years") +
  labs(x = "Year", y = "Mesh Size (mm)",
       title = "Samples by Year and Mesh Size",
       subtitle = "Excluding 1976; no. samples centered over median") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Month and MeshSize,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Month, y = MeshSize, color = MeshSize)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -3, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               })+
  scale_color_brewer(palette = "Spectral") +
  labs(x = "Month", y = "Mesh Size (mm)",
       title = "Samples by Month and Mesh Size",
       subtitle = "Excluding 1976; no. samples labelled by month") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Replicates and Year,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Date, y = Replicates, color = Replicates)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               }) +
  scale_color_brewer(palette = "Spectral") +
  scale_x_date(date_labels = "%y",
               date_breaks = "2 years") +
  labs(x = "Year", y = "Replicates",
       title = "Samples by Year and Replicates",
       subtitle = "Excluding 1976; no. samples centered over median") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Replicates and Month,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Month, y = Replicates, color = Replicates)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               })+
  scale_color_brewer(palette = "Spectral") +
  labs(x = "Month", y = "Replicates",
       title = "Samples by Month and Replicates",
       subtitle = "Excluding 1976; no. samples labelled by month") +
  theme_minimal() +
  theme(legend.position = "none")

---
title: "NOAA Nearshore Fish Atlas, Exploratory Data Analyses"
author: "Chris Guo"
date: "Last compiled on `r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: TRUE
    toc_depth: 2
    toc_float:
      collapsed: FALSE
      print: FALSE
    number_sections: TRUE
    code_download: TRUE
theme: "flatly"
---

```{r include = FALSE}
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(size = "scriptsize")
```

Built with R version `r getRversion()`.

# Introduction

The purpose of this document is to make open and share-able the research methods used for my dissertation on nearshore fish communities in Alaska working towards a PhD in marine biology at the University of Alaska Fairbanks. Here, I cover steps of the data preparation for my second and third chapters concerning spatial and temporal distributions of nearshore fishes across the state. This and other files can be accessed via the Kachemak Bay National Estuarine Research Reserve's github [nearshore fish repository](https://github.com/kbnerr/nearshore-fish).

In this document I share exploratory data analyses conducted on the NOAA Nearshore Fish Atlas (NFA) database. The NFA database can be found here, <https://alaskafisheries.noaa.gov/mapping/sz/>. Namely, this document contains the steps taken after cleaning/wrangling raw data. In particular I produce various visualizations of the data in space and time. Initial wrangle steps can be viewed at this [Rpubs page](https://rpubs.com/chguo1/1188614) and its data objects are sourced below (file "NFA.rda").

## Set up

Load required packages, define directory, set options, source/load files:

```{r results = 'hide'}
# Packages
library(tidyverse)
library(lubridate)
library(here)
library(leaflet)
library(RColorBrewer)

# Directory
wd = here()
dirs = wd %>% list.files() %>% str_subset(pattern = "^README|^LICENSE|.md$|.Rproj$", negate = TRUE)
for (i in seq_along(dirs)) {
  name = str_replace_all(dirs[i], "^", "dir.")
  path = str_replace_all(dirs[i], "^", str_c(wd, "/"))
  assign(name, path)
  rm(name, path, i)
}

# Options

# Source/Load
load(file.path(dir.data, "NFA_wrangled.rda")) # wrangled data
```

## Background and objectives

Based on my own research and related literature, I can broadly interpret these beach seine data in a spatiotemporal context. Depending on when (time of year) and where (location and habitat), I have a general structure in mind of the diversity of the community. Depending on who's there, I have a rough guess of the relative abundance of each community member (at least for the more commonly caught ones). My hope with working with the NFA data is that this broad understanding can be formally tested with inference and statistics, and that more informative research questions can be answered.

Findings from other nearshore fish researchers largely agree that seasonality (or some related environmental condition) is a strong predictor of community or species presence. Interannual differences can exhibit wide variability and should be accounted for whenever possible. And at larger time-scales (e.g., decadal), communities exhibit changes which may relate to regime shifts occurring more broadly than the nearshore.

Spatially, large-scale (regional) effects appear to be as or more important than sub-regional or local-scale effects. However, many studies also find significant relationships in community response at these smaller scales. Likely, the most appropriate spatial consideration depends on the question being asked. So maybe a more appropriate question to ask of the NFA data is if there any significant spatial scales discernible in the data, and also how should we address those scales in future research, such as studies on subsets of the taxa or habitat management considerations.

I'll start by exploring the structure of our 'visits' dataframe, and then I'll move on to visualizing the information derived from out 'catch' dataframe.

# Exploring visit data

## Summary of visits

Remember that in the data wrangling stage, we created a new sample identifier based on site and date which we called VisitID. Let's take a look at the variables associated with each VisitID:

```{r}
glimpse(visits)
```

During the wrangling stage, I decided that Replicates and MeshSize should be the main covariates that needed consideration. At this point they are classed as integers/numbers, but we will probably want them to be factors instead. Cluster is a byproduct of clustering some sites together, so it may not be needed if we're using Lat/Lon as our primary explanatory variable. Still, it could be a useful grouping factor later on. Date will likely be mutated into multiple additional periods (day of year, month, year) to see if any of those factors are useful in explaining fish data. We'll address edits to these variables as they come up in our exploration.

## Spatial distribution

```{r}
leaflet(visits) %>% 
  addTiles(options = tileOptions(minZoom = 3.5,
                                 zIndex = 0.5)) %>%
  setView(lat = 65, lng = -152, zoom = 3.5) %>%
  addSimpleGraticule(interval = 5) %>%  
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   stroke = FALSE,
                   popup = ~VisitID,
                   clusterOptions = markerClusterOptions())
```

In the map above we can view all of our samples from across Alaska. Leaflet options are nice because you can optionally group samples on the map and show a boundary polygon of the area they cover when you hover over their icon. At the minimum zoom level (3.5), we see the largest group spanning nearly all of Southeast Alaska, followed by a group in Southcentral that covers Prince William Sound, Cook Inlet, and the northern half of Kodiak Island. The third largest group forms a much smaller polygon area around Utqiagvik. From there, we see even smaller groups from the Eastern Beaufort Sea around Kaktovik, the Bering Strait and Southern Chukchi Sea near Kotzebue, and Bristol Bay. There are also two groups from the Alaska Peninsula and Aleutian Islands, one focused around Adak and another spanning from Sand Point to Unalaska. We can see how subsets of the samples change by zooming in. When there are single samples in view, hovering over those will tell us its VisitID name.

Obviously, this visualization of the data is not perfect and pretty rough- note that the groups are formed by the overlap of samples based on pixel radius (10). For example, the polygon of samples from Bristol Bay at the minimum zoom level actually crosses the Alaska Peninsula to include a sample from Aghiyuk Island. Although, the map is useful in gaining a general understanding of the existing clusters of samples within the NFA database.

## Regions

We may want to define a spatial factor based on these rough groupings as a starting place to see if the smaller, more isolated groups should be lumped with other groups or not. Actually, the NFA already has a Region classifier per site, but the reason I do not want to use it is because they linked Region to SiteID. When we created our new VisitID, some samples now contained two Region labels- these were mostly cases from around Utqiagvik where some were labelled as Chuckchi and others as Beaufort but actually occurred on the same day and within a very short distance of each other.

Instead of using the the given regional classes, let's just make our own variable called 'Region' and add it to the our visits df:

```{r}
# Start with events so that we can make use of Location info for unique cases,
regions = events %>%
  select(EventID, VisitID, Lat, Lon, Location) %>%
  mutate(Region = case_when(Lat > 69 & Lat < 80 & Lon > -150 & Lon < -140 ~ "Beaufort East",
                            Lat > 69 & Lat < 80 & Lon > -160 & Lon < -150 ~ "Chukchi/Beaufort",
                            Lat > 65 & Lat < 69 & Lon > -170 & Lon < -160 ~ "Chukchi South",
                            Lat > 57 & Lat < 60 & Lon > -165 & Lon < -155 ~ "Bristol Bay",
                            Lat > 50 & Lat < 55 & Lon > -180 & Lon < -175 ~ "Aleutians Adak",
                            Lat > 50 & Lat < 55 & Lon > -170 & Lon < -165 ~ "Aleutians Unalaska",
                            Lat > 55 & Lat < 65 & Lon > -155 & Lon < -145 ~ "GOA Southcentral",
                            Lat > 53 & Lat < 61 & Lon > -148 & Lon < -130 ~ "GOA Southeast"))

# Check which samples we missed,
filter(regions, is.na(Region))

# Let's address those using Location,
regions = mutate(regions,
       Region = case_when(!is.na(Region) ~ Region,
                          is.na(Region) & str_detect(Location, "Yunaska") ~ "Aleutians Yunaska",
                          is.na(Region) & str_detect(Location, "Aghiyuk") ~ "Aleutians Aghiyuk",
                          is.na(Region) & str_detect(Location, "Shumagin") ~ "Aleutians Shumagin"))

# Check again (should be 0),
filter(regions, is.na(Region)) %>% nrow()
```

Now let's add Region as a factor to our visits df, and show it on a map. Since we're manipulating our data objects again, let's also rename visits so that we can keep track of our changes.

```{r}
# Call our wrangled df version visits.0 and remove the non-version 'visits' if it's there,
if ("visits" %in% ls()) assign("visits.0", visits)
if ("visits" %in% ls()) rm(visits)

# Make Region a factor and join by VisitID,
visits.1 = regions %>%
  select(VisitID, Region) %>%
  distinct() %>%
  mutate(Region = as.factor(Region)) %>%
  left_join(visits.0, ., by = "VisitID")

# Function for regional palette
palette.region = colorFactor(palette = brewer.pal(n = visits.1$Region %>% n_distinct(),
                                                  name = 'Spectral'),
                             domain = factor(visits.1$Region))

# Map
leaflet(visits.1) %>% 
  addTiles() %>%
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   color = ~palette.region(Region),
                   fillColor = ~palette.region(Region),
                   popup = ~paste(Region, VisitID),
                   stroke = FALSE) %>%
  addLegend(position = "topright",
            pal = palette.region, values = ~Region,
            title = "Region",
            opacity = 0.75)
```

```{r echo = FALSE}
# Number of samples per Region
select(visits.1, VisitID, Region) %>%
  count(Region) %>%
  distinct() %>%
  ggplot(., aes(x = Region, y = n, fill = Region)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5) +
  scale_fill_brewer(palette = "Spectral") +
  labs(x = NULL, y = NULL, title = "No. samples per Region") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())
```

Like we saw before, the majority of our samples are located in the Gulf of Alaska (GOA), namely Southeast (SEAk) and Southcentral (SCAK). If we were to look at differences in catach in the GOA, it'd be cool if we could include the Aleutian samples too. Obviously, we'll want to address those few instances from Aghiyuk, Shumagin, and Yunaska, and ideally we aggregate them into one or two groups. We also see relatively small sample sizes from Bristol Bay and from East Beaufort, which may or may not be prohibitive in conducting meaningful tests. One other consideration is the couple of samples along the Bering Strait coast which are lumped together with a larger concentration of samples from South Chukchi.

Some of our first analyses should see if the more isolated samples are different from the other samples. If not, then we can aggregate. If yes, then we should report any significant tests.

## Temporal distribution

Next we'll want to take a look at how our samples distribute throughout various time periods. This will give us a good idea of how well our samples may or may not overlap- particularly, we'll want to see overlaps in years and months for better comparisons of the catch data.

```{r}
# First create a df specifically containing at time variables
times = visits.1 %>%
  select(VisitID, Date) %>%
  mutate(Year = year(Date),
         Month = month(Date, label = TRUE),
         Week = week(Date),
         Day = yday(Date))

# Year
times %>%
  count(Year) %>%
  ggplot(., aes(x = Year, y = n)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Year")

# Year without 1976
times %>%
  filter(Year > 1980) %>%
  count(Year) %>%
  ggplot(., aes(x = Year, y = n)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Year without 1976")

# Month
times %>%
  count(Month) %>%
  ggplot(., aes(x = Month, y = n)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.5, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Month")

# Week
times %>%
  count(Week) %>%
  ggplot(., aes(x = Week, y = n)) +
  geom_col() +
  geom_text(aes(label = n), hjust = -0.15, angle = 90, color = "blue") +
  labs(x = NULL, y = NULL, title = "No. samples per Week")

# Day of Year
times %>%
  count(Day) %>%
  distinct() %>%
  ggplot(., aes(x = Day, y = n)) +
  geom_col() +
  labs(x = NULL, y = NULL, title = "No. samples per Day of Year")
```

By year, we see a large variability in number of samples, as well as a lone 1976 year which I recognize to be Blackburn's dataset from Lower Cook Inlet. That 1976 is an interesting dataset that could be useful for a look at catches over time in that specific area, but it'll need to be excluded from any other comparisons. The lowest number of samples occur in 2010, 2020, and 2021. We may want to pay close attention to these years in case they show up as outliers. It seems that 1996-2000 is a period of heavy sampling, followed by the mid 2010s and somewhat active year in 2006.

By month, we see an obvious uptick during the summer which makes sense given winter conditions along most of Alaska's coast (light, ice, etc.). March and October are interesting months because of the potential for impacts on fishes during years with warmer shoulder seasons. November thru February will likely be removed for our purposes. Our key months look to be June, July and August, with possible inclusion of April, May, and September.

It's interesting if we look at the by-week and by-day graphs, we see a dip in sampling around mid-June/early-July. Not sure why this is- maybe related to more targeting of juvenile salmonids in early June, or maybe so reduced effort around the 4th of July holiday.

## Year & month by region

Let's now see how our samples overlap by Region and Year, and by Region and Month. For these purposes, let's combine the Aleutians into one group for now, and remove the 1976 data.

```{r}
# Samples by Year and Region,
left_join(filter(times, Year > 1980),
          select(visits.1, VisitID, Region), by = "VisitID") %>%
  mutate(Region = case_when(str_detect(Region, "Aleutians") ~ "Aleutians",
                            !str_detect(Region, "Aleutians") ~ Region)) %>%
  ggplot(aes(x = Date, y = Region, color = Region)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1.9, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               }) +
  scale_color_brewer(palette = "Spectral") +
  scale_x_date(date_labels = "%y",
               date_breaks = "2 years") +
  labs(x = "Year", y = "Region",
       title = "Samples by Year and Region",
       subtitle = "No. samples centered over median") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Month and Region,
left_join(filter(times, Year > 1980),
          select(visits.1, VisitID, Region), by = "VisitID") %>%
  mutate(Region = case_when(str_detect(Region, "Aleutians") ~ "Aleutians",
                            !str_detect(Region, "Aleutians") ~ Region)) %>%
  ggplot(aes(x = Month, y = Region, color = Region)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1.9, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               })+
  scale_color_brewer(palette = "Spectral") +
  labs(x = "Month", y = "Region",
       title = "Samples by Month and Region",
       subtitle = "Excluding 1976; no. samples labelled by month") +
  theme_minimal() +
  theme(legend.position = "none")
```

Interesting that each region's samples center on a different year. At a glance, I see the most overlap in 2006 where the Aleutians and both GOA regions look to have similar sample densities, plus there is a smattering of Chukchi/Beaufort sampling too. In 2009, the Bristol Bay samples may be compared with both GoA regions and the Chukchi/Beaufort, although the samples look more sparse across the regions. And in 2013, the Chukchi/Beaufort has good overlap with both GOA regions again. I see nice periods of comparison between SEAK and SCAK, but there is also a concerning difference in yearly distribution of samples between the two regions. Depending on how important year effects appear- we may need to trim the data for better comparisons.

By month, we see that August contains samples from all regions. June is the center for SEAK and the Aleutians, July for SCAK, and August for Chukchi/Beaufort. Best comparisons from this monthly view may be an August (or lumped July + August) comparison of all Regions. Also, a June (or lumped June thru August) comparison of GOA and the Aleutians would be good. And, we should be able to nicely compare SEAK and SCAK from May to September, potentially including April.

For now, we'll hold off on excluding any of the data. It'll be better to revisit these considerations after deciding which research questions to tackle, so subsets of the data can be made specific to each one.

## Mesh Size and Replicates

Let's make similar visuals as before, but now focus on Mesh Size and Replicates in space and time. First, we can re-map the samples to color by Mesh Size and size of points relative Replicates (large dots = more replicates).

```{r}
# Palette function to color markers
palette.mesh = colorFactor(palette = brewer.pal(n = visits.1$MeshSize %>% n_distinct(),
                                                name = 'Spectral'),
                           domain = factor(visits.1$MeshSize))
# Map
leaflet(visits.1) %>% 
  addTiles() %>%
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   radius = ~(sqrt(Replicates*15)),
                   color = ~palette.mesh(MeshSize),
                   fillColor = ~palette.mesh(MeshSize),
                   popup = ~VisitID,
                   stroke = FALSE) %>%
  addLegend(position = "bottomright",
            pal = palette.mesh, values = ~MeshSize,
            title = "Mesh Size (mm)",
            opacity = 0.75)
```

I have an issue discerning among the seven different sizes since those warmer colors do not diverge well. Although there are some patterns to make note of: the largest mesh size of 12.7 only occurs in a small area of SEAK, the 10 mm samples similarly only occur with the 12.7 mm samples (likely the [Lundstrom et al. 2022](https://doi.org/10.1007/s12237-022-01057-x) gear comparison), and the smallest size at 3 mm only occurs in Cook Inlet. I don't see any obvious spatial patterns in Replicates, although this may not be the best view of it.

Let's see the frequency of samples for both variables:

```{r}
# Table view
table(visits.1$MeshSize)
table(visits.1$Replicates)
```
There are more 12.7 mm samples than I would have guessed which is a shame because if they are completely different then we lose a lot of samples. I am wondering if sizes are accurate- I would think that 3 mm and 3.2 mm are effectively similar in practice. I am also wondering if some projects report, say 3 mm instead of 3.2 mm, because of rounding. I would assume that the nets used were manufactured in the US, which would mean that specifications are given in standard not metric. Let's do a quick check on how typical mesh sizes convert from inches to millimeters.

```{r}
# Function based on inch to millimeter conversion
in_to_mm = function (x) {
  return(y = 25.4 * x)
}

# Typical mesh sizes in inches
sizes = c((1/8), (1/4), (3/8), (1/2), (5/8), (3/4))

# Convert to millimeters
in_to_mm(sizes)
```

There is obviously some funny rounding happening in the data. I do not see an issue by lumping mesh sizes that are within one millimeter. Even if the nets were actually different by fractions of a millimeter, I don't think it would affect the selectivity of fishes all that much- especially considering that most of these researchers were capturing fish to be measured to the nearest millimeter.

Let's see that map again with this in mind:

```{r}
# Palette function for binned sizes
palette.mesh.bin = colorBin(palette = brewer.pal(n = 4, name = 'Spectral'),
                            domain = visits.1$MeshSize %>% unique(),
                            bins =  c(0, 4, 7, 11, 13))
# Map
leaflet(visits.1) %>% 
  addTiles() %>%
  addCircleMarkers(lng = ~Lon, lat = ~Lat,
                   radius = ~(sqrt(Replicates*15)),
                   color = ~palette.mesh.bin(MeshSize),
                   fillColor = ~palette.mesh.bin(MeshSize),
                   popup = ~VisitID,
                   stroke = FALSE) %>%
  addLegend(position = "bottomright",
            colors = c(palette.mesh.bin(3),
                       palette.mesh.bin(6),
                       palette.mesh.bin(9),
                       palette.mesh.bin(12)),
            values = ~MeshSize,
            title = "Mesh Size (mm)",
            opacity = 0.75,
            labels = c("3 or 3.2", "6 or 6.4", "9.5 or 10", "12.7"))
```

Much easier to read! I see some concerning patterns we'll want to investigate further as we go. The majority of samples appear to be using the 3.2 mm size, including both Arctic groups that seem to use it exclusively. There is a large density of 6.4 mm samples in Northern SEAK and throughout the Aleutians. The same issues are visible concerning the two larger mesh sizes. We should do a comparison of 3.2 mm and 6.4 mm to see if there are any significant differences in catch.

With the simpler color scheme, I am also noticing the difference in Replicates a little easier. Just looking at GOA and Aleutian samples, I see that many of the 6.4 mm samples are also low-Replicate samples. Although, many of our samples only contain one replicate so this may just be something I'm seeing and not actually a concern.

Let's take a look at how these variables in graphical format. I'll go ahead and aggregate similar mesh sizes and re-classify the both variables as factors, and add in variables for year and month.

```{r}
visits.2 = mutate(visits.1,
                  Replicates = as_factor(Replicates),
                  # Combine similar mesh sizes to the tenth decimal
                  MeshSize = case_when(MeshSize < 4 ~ 3.2,
                                       MeshSize > 4 & MeshSize < 7 ~ 6.4,
                                       MeshSize > 7 & MeshSize < 11 ~ 9.5,
                                       MeshSize == 12.7 ~ 12.7) %>%
                    as.factor()) %>%
  # Add temporal variables
  left_join(select(times, VisitID, Year, Month), by = "VisitID")
```

```{r}
# Samples by MeshSize and Year,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Date, y = MeshSize, color = MeshSize)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -3, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               }) +
  scale_color_brewer(palette = "Spectral") +
  scale_x_date(date_labels = "%y",
               date_breaks = "2 years") +
  labs(x = "Year", y = "Mesh Size (mm)",
       title = "Samples by Year and Mesh Size",
       subtitle = "Excluding 1976; no. samples centered over median") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Month and MeshSize,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Month, y = MeshSize, color = MeshSize)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -3, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               })+
  scale_color_brewer(palette = "Spectral") +
  labs(x = "Month", y = "Mesh Size (mm)",
       title = "Samples by Month and Mesh Size",
       subtitle = "Excluding 1976; no. samples labelled by month") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Replicates and Year,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Date, y = Replicates, color = Replicates)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               }) +
  scale_color_brewer(palette = "Spectral") +
  scale_x_date(date_labels = "%y",
               date_breaks = "2 years") +
  labs(x = "Year", y = "Replicates",
       title = "Samples by Year and Replicates",
       subtitle = "Excluding 1976; no. samples centered over median") +
  theme_minimal() +
  theme(legend.position = "none")

# Samples by Replicates and Month,
filter(visits.2, Year > 1989) %>%
  ggplot(aes(x = Month, y = Replicates, color = Replicates)) +
  geom_boxplot() +
  geom_jitter(size = 0.5, alpha = 0.75) +
  stat_summary(geom = "text", vjust = -1, color = "black",
               fun.data = function(x) {
                 return(c(y = median(x), label = length(x)))
               })+
  scale_color_brewer(palette = "Spectral") +
  labs(x = "Month", y = "Replicates",
       title = "Samples by Month and Replicates",
       subtitle = "Excluding 1976; no. samples labelled by month") +
  theme_minimal() +
  theme(legend.position = "none")
```

